Discourse Automatic Annotation of Texts: An Application to Summarization

نویسندگان

  • Antoine Blais
  • Iana Atanassova
  • Jean-Pierre Desclés
  • Mimi Zhang
  • Leila Zighem
چکیده

The exploitation of the discourse structure of a text and the identification of the discourse categories are essential elements for the automatic summarization, as well as for the textual information retrieval. In this paper we will describe an automatic summarization strategy that uses these elements as the basis for the extraction of the most relevant textual segments that will constitute the summary. Certain linguistic markers allow us to annotate automatically a text according to discourse categories, in order to make visible the discourse structure and the discourse categories in the text. Our approach is domain independent and the discourse categories that we use for summarization are general for all natural languages. This makes it possible to apply our method to articles in various domains and in different languages. Introduction to Automatic Summarization We present bellow the two main approaches to build automatically the summary of a text (see for more details Mani 2001). The first approach is the automatic summary production by comprehension. Originating from the domain of Artificial Intelligence, this approach considers the process of automatic summarization as being similar to some extent to the human summarization activity and the automatic summarization is based on the partial or total comprehension of the text. The program must be able to build a representation of the text, which might eventually be modified, in order to generate from it a summary. However, this method is quite difficult to carry out as it requires automatic text comprehension, text representation, as well as automatic text generation. The existing methods for these tasks are still quite unsatisfactory. The second approach is the automatic summarization by extraction, which is inspired by the domain of Information Retrieval. The goal of this approach is to provide quickly a simple informative summary, without making a deep analysis of the text. In this method, we search and extract the most relevant textual segments (often sentences and paragraphs) in order to constitute an extract that we consider as the summary. The central procedure consists in evaluating the relevancy of textual segments according to one or more criteria. There exist two major methods to do this. The first one is the statistical method. It uses numerical methods to measure the relevance of a given text segment according to the presence of certain terms that are representative of the text (using a frequency calculation). Different heuristic criteria, such as the position in the textual structure or the presence of title’s terms, could also be used. Other methods that rely more on linguistic knowledge use the presence of surface linguistic markers to establish the relevance of a textual segment. Some particular linguistic markers allow us to attribute a semantic (discourse or rhetorical) value to a textual segment, according to a linguistic theory, and thus to find out its relevance to the summary. Among the advantages of the extraction method is that it does not make a deep analysis of the text and does not use text generation. On the other hand, it provides a summary by using simple algorithms of extraction and does not rely on any kind of text comprehension. The disadvantages of this method are often attributed to the lack of coherence of the summary and the fact that the broken interconnections between the different textual segments that have been juxtaposed could change the text’s interpretation. Nevertheless, this approach remains the only possible one for the moment from the point of view of computer implementation. Presently, the automatic summarization is more and more oriented towards the production of flexible summaries that correspond to some specific user needs. That is why the summarization strategies should allow the production of different summaries of the same text according to the needs of the user. Furthermore, the notion of automatic summarization tends to be integrated more and more with other similar applications that rely on common text processing methods. The LaLICC laboratory of the university of ParisSorbonne (ParisIV) has been working for several years in the domain of automatic summarization. The realization of different projects, such as SERAPHIN (Berri 1995), SAFIR (Berri et al. 1996) and ContextO (Crispino et al. 2003), has led to some discussion and the production of Copyright © 2007, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Cross-Lingual Approach to the Discourse Automatic Annotation: Application to French and Bulgarian

In this paper we propose a cross-lingual approach to the discourse automatic annotation of scientific articles by the Contextual Exploration method. We present an application to French and Bulgarian as an illustration of the possibility to work with different languages that the method provides. We describe the methodology for the construction of the linguistic resources for Bulgarian based on t...

متن کامل

Detecting the central units in two different genres and languages: a preliminary study of Brazilian Portuguese and Basque texts Detección de la unidad central en dos géneros y lenguajes diferentes: un estudio preliminar en portugués brasileño y euskera

The aim of this paper is to present the development of a rule-based automatic detector which determines the main idea or the most pertinent discourse unit in two different languages such as Basque and Brazilian Portuguese and in two distinct genres such as scientific abstracts and argumentative answers. The central unit (CU) may be of interest to understand texts regarding relational discourse ...

متن کامل

Subtopic annotation and automatic segmentation for news texts in Brazilian Portuguese

Subtopic segmentation aims to break documents into subtopical text passages, which develop a main topic in a text. Being capable of automatically detecting subtopics is very useful for several Natural Language Processing applications. For instance, in automatic summarisation, having the subtopics at hand enables the production of summaries with good subtopic coverage. Given the usefulness of su...

متن کامل

Cunha towards discourse parsing in Spanish

texts can be analysed from different perspectives. one of the most difficult phenomena to process is discourse structure (hovy 2010). in recent years, one of the main challenges in the field of natural language processing (nlp) has been discourse parsing. research on this topic has been done for several languages, such as Japanese (Sumita et al. 1992), english (marcu 2000) and portuguese (pardo...

متن کامل

Mining Discourse Markers For Chinese Textual Summarization

Discourse markers foreshadow the message thrust of texts and saliently guide their rhetorical structure which are important for content filtering and text abstraction. This paper reports on efforts to automatically identify and classify discourse markers in Chinese texts using heuristic-based and corpus-based data-mining methods, as an integral part of automatic text summarization via rhetorica...

متن کامل

Enhancement Of A Chinese Discourse Marker Tagger With C4.5

Discourse markers are complex discontinuous linguistic expressions which are used to explicitly signal the discourse structure of a text. This paper describes efforts to improve an automatic tagging system which identifies and classifies discourse markers in Chinese texts by applying machine learning (ML) to the disambiguation of discourse markers, as an integral part of automatic text summariz...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007